feat: Add label polymorphism #4388

Marwes · 2022-01-05T16:11:17Z

POC to introduce "label polymorphism" which will allow type signatures to more precisely specify how they transform records, thereby being able to statically determine more errors. This works by letting string literals get the new label("literal") type instead of just string which in conjunction with record fields being capable of being type variables lets functions accept a "label" as a parameter which determines which fields that it runs on.

~~Allowing multiple labels to be passed in an array? (columns: [L]) Probably needs to treat arrays as heterogeneous lists of some sort so they are typed like ["a", "b"] instead of [string].~~ Should be a separate issue, implemented later if desired.
On the other hand functions still want to accept a plain dynamic string (as they do today). So unification needs to cope with that still. We may need "dynamic" labels for this { A with 'dynamic: int, 'dynamic: B } where a record with one (or more!) dynamic labels would accept any field being accessed. We are moving ahead with just statically defined labels, should be rare enough with dynamic ones that we can accept any breakages (though we do need to verify that).
Some better syntax for specifying/printing labels so it is clear when they differ. Currently the same single, uppercase leter syntax for variables are use like { A: int }, but that makes it impossible to define a record type with the field A. Add a "A" syntax perhaps (same as the record creation/indexing syntax)? See feat: Accept string literals in the fields of a record type #4664
Type variables are currently re-numbered during the convert pass so you can get confusing errors where it says that the field C does not exist, except the field were actually specified with B (as seen below). (This happens at the master branch today as well) Should be another issue, may not bother with this
The delayed unfication that is currently there is hacky. It is needed to ensure that we see which label B has before we do the record unfication. A better way may be to add a constraint on B like B must be one of the fields in record X which can be solved later.
Feature flag so we can release and compile code without labels.
Need more tests, or confidence that current tests in labels.rs have adequate coverage

Based on #4664

nathanielc

Overall I like the direction this proposal is going. Have a few questions below.

nathanielc · 2022-02-23T17:10:14Z

libflux/Polymorphic_Labels.md

+{ 'column: A } <=> { b: string, a: int }
+```
+
+There may be a consistent way to unify these records in the face of type variables, however an easy workaround would be to delay the unification of records with unknown fields until they have been resolved, at which point they can unify normally. If there is a field that is still unknown when type checking is done we can designate that as an error.


To me delaying unification until records with unknown fields have been resolved makes the most sense. Its how I reason about this logic in the first place. Are there any drawbacks to delaying? Multiple passes over the semantic graph?

It adds some complexity, and could possibly give worse error messages (though as long as we take care to store all the context we need for the errors, it shouldn't be an issue).

There may be some performance implications as well as we need to know/decide when to evaluate these delayed constraints. Constraints could (in theory at least) depend on each other in complex ways which we would need to untangle to resolve them all

A bit of a tangent, but Rust's typechecker has an largely equivalent system of "obligations" which it brute forces when solving them, I tried to improve that with rust-lang/rust#69218 which would make them know more precisely when they would be solveable, but the overhead was to much).

nathanielc · 2022-02-23T17:11:34Z

libflux/Polymorphic_Labels.md

+func(opt: "a")
+
+// Possible extension where we only allow some specific labels to be passed in
+builtin func : (opt: "option1" | "option2") => int


Is the idea here to make a sort of enum type? For things like the method parameter to the quantile function?

Why would we use labels for this other than its convenient? Meaning I don't see a use case yet of when a label used in the context of a column name would be constrained to a specific set.

Is the idea here to make a sort of enum type? For things like the method parameter to the quantile function?

Yes, at least a limited one. I mostly just extracted it from #4410 where I found some functions which accept only a limited number of strings as an argument (see group and stddev)

nathanielc · 2022-02-23T17:13:56Z

libflux/Polymorphic_Labels.md

+// We should (probably) still allow dynamic strings for backwards compatibility sake so the `string` type
+// also implements the `Label` Kind
+c = "a" + ""
+fill(column: c, value: 0)


I wonder how often a truly dynamic column name is used? Maybe we can get some of that from the query archive?

Also what do we consider dynamic? Anything that requires evalauting the semantic graph? What about this case:

c = "foo" fill(column: c, value: 0)

Is that considered dynamic or static? Seems like it could be considered static. I would also expect this case to be very common as a user might have to use the column name in several different functions and so would like to put it in a variable to make changing it easier.

Yes, that could be considered static in this context (probably harder to force it to be "dynamic" than not).

Marwes · 2022-03-17T17:21:38Z

For just about all functions defined with polymorphic labels in #4410 I can think of a way to redefine them to not require polymorphic labels which may be an argument for not doing this.

builtin columns : (<-tables: [A], ?column: 'label) => [{ 'label: string }] where A: Record
// into, if we want a different name on the output that is a `map(fn: fn(r) => { mycolumn: r.value })`
builtin columns : (<-tables: [A]) => [{ value: string }] where A: Record

builtin count : (<-tables: [A], ?column: string) => [{ _value: int }] where A: Record
// A `map(fn: fn(r) => r.mycolumn)` before `count` would let us pick the column
builtin count : (<-tables: [A]) => [{ _value: int }]

builtin distinct : (<-tables: [{ A with 'column: B }], ?column: 'column) => [{ _value: B }] where A: Record, B: Equatable
// into
builtin distinct : (<-tables: [A]) => [B] where A: Equatable

// Can't write out a type without polymorphic labels as `A` is preserved in the output
// but users can write it manually as `map(fn: fn(r) => { r with as: r.column }` (vs `duplicate(column: "column", as: "as")`
builtin duplicate : (<-tables: [{ A with 'column: B }], column: 'column, as: 'as) => [{ A with 'column: B, 'as: B }]
    where A: Record

// Same as `columns` and `count`. `map` before and after can "rename" the columns to and from the format we want
builtin elapsed : (<-tables: [{ A with 'time: time }], ?unit: duration, ?timeColumn: 'time, ?columnName: 'column) => [{ A with 'time: time, 'column: duration }]
    where
    A: Record

I didn't go through all of the functions thoroughly but other than issues implementing these changed functions it seems pretty plausible to "only" provide functions that operate on streams of plain values instead of records (or records with specific fields and use map to go between).

Lists of polymorphic labels (which is not part of this PR) seem more complicated but at least some operations could be defined.

// map(fn: fn(r) => { column_1: r.column_1, column_2: r.column_2 } // etc 
builtin keep : (<-tables: [{ A with ['column]: C }], ?columns: ['column], ?fn: (column: string) => bool) => [B] where A: Record, B: Record

// Can't be done
builtin drop : (<-tables: [{ A with ['column]: B }], ?fn: (column: string) => bool, ?columns: ['column]) => [{ C with  }]


builtin covariance : (<-tables: [{ A with ['column]: B }], ?pearsonr: bool, ?valueDst: 'dst, columns: ['column]) => [{ A with 'dst: B }]
    where
    A: Record
// Specify two specific fields in the signature and `map` to them
builtin covariance : (<-tables: [{ A with column1: B, column2: B }], ?pearsonr: bool) => [{ A with dst: B }]
    where
    A: Record

// Can't be translated, we rely on the other fields (`A`) to be preserved
builtin sort : (<-tables: [{ A with ['column]: B }], ?columns: ['column], ?desc: bool) => [{ A with ['column]: B }]
    where A: Record,
          B: Comparable

wolffcm · 2022-03-17T18:40:52Z

@Marwes

In general I like the idea of using smaller composable operations, and especially map since we are making it more performant now. I have some concerns though.

For this example with aggregate functions:

builtin count : (<-tables: [A], ?column: string) => [{ _value: int }] where A: Record
// A `map(fn: fn(r) => r.mycolumn)` before `count` would let us pick the column
builtin count : (<-tables: [A]) => [{ _value: int }]

How does this work for a typical example where there are group key columns? So say something like

import "array"
array.from(rows: [
{host: "A", region: "east", mem: 100},
{host: "B", region: "east", mem: 100},
{host: "C", region: "west", mem: 100},
{host: "D", region: "west", mem: 100},
])
  |> group(columns: ["region"])
  |> sum()

For this example I want to get the sum of memory used by region. How do we preserve the separation between regions if aggregates always produce [{_value: int}]?

A broader question, are you thinking that users will not have to change their scripts in order to get the same behavior as before? Or are you assuming that some changes will be required of users?

Marwes · 2022-03-18T14:42:16Z

For this example I want to get the sum of memory used by region. How do we preserve the separation between regions if aggregates always produce [{_value: int}]?

True, aggregators need to take groups into account, and the group concept is sort of shared between records and streams. The stream tracks each group, but to check the group key of a record that is accessed through the record. So returning or accepting streams of raw values is probably not that useful.

I suppose there could be some way to try and separate group keys from records, but that likely strays to far towards breaking changes and it may just end replicating what records already do. So I think we need to stick with streams of records for that reason which in turn forces aggregate functions (and other transformations) to accept which column(s) to work with as a parameter.

A broader question, are you thinking that users will not have to change their scripts in order to get the same behavior as before? Or are you assuming that some changes will be required of users?

No, users would be able to keep everything as is, but when working with pivoted data they would need to add map to select the column to work with. However given that aggregators expect groups to be preserved we would need to do something like map(fn: fn(r) => { r with value: r.mycolumn }) to preserve the group key (I think) which doesn't seem great.

For #4388 we need a way to describe a type (label) variable in place of a record field. As type variables are described as a single, uppercase letter, mimicking that syntax will make it impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.

Co-authored-by: Scott Anderson <[email protected]>

<error> isn't the best in this error message but it isn't worse than the previous error

This were causing errors due to accessing the `experimental/universe` imports where accessing, say, `universe.fill` would instantiate all the functions in the `universe` record but only the labels of the used function would get filled which then triggers the error. I thought this was necessary to prevent ambiguities from arising when doing unifications like ``` { A with B: string } <=> { C with test: string } { A with B: string } <=> { D with abc: string } ``` as unifying the labels here (`B <=> "test"/"abc"`) would cause different results depending on which were done first. However I was able to remove that code so record labels that do not get concrete will just hang around unused which will at least not cause any problems.

…e command line

Marwes mentioned this pull request Jan 7, 2022

refactor: Preparatory cahnges for polymorphic labels #4393

Merged

Marwes mentioned this pull request Jan 18, 2022

Document what kind of type system features are lacking for builtin functions #4410

Closed

onelson mentioned this pull request Feb 15, 2022

Review and discuss Polymorphic Labels proposal #4482

Closed

Marwes force-pushed the labels branch 2 times, most recently from 0ed93df to f39b0c2 Compare February 15, 2022 11:52

Marwes mentioned this pull request Feb 18, 2022

refactor: Generalize Substitutable::apply_ref #4499

Merged

Marwes force-pushed the labels branch 2 times, most recently from f056b51 to 794a801 Compare February 21, 2022 15:03

nathanielc reviewed Feb 23, 2022

View reviewed changes

Marwes force-pushed the labels branch 6 times, most recently from c70fb56 to 848cd8e Compare March 31, 2022 16:41

Marwes mentioned this pull request Apr 4, 2022

Implement polymorphic labels in the type system #4635

Closed

Marwes mentioned this pull request Apr 11, 2022

feat: Accept string literals in the fields of a record type #4664

Merged

Marwes force-pushed the labels branch 3 times, most recently from cddf3bf to 98a3984 Compare April 14, 2022 12:36

Marwes force-pushed the labels branch 3 times, most recently from 78e85ac to 70914d2 Compare April 20, 2022 13:08

Markus Westerlind added 6 commits April 28, 2022 12:26

fix: Strings should not implement the Label kind

cd376cf

chore: Fix some undefined reference errors in make fmt-flux

6f47855

chore: Add the Label kind to the experimental/universe functions

a1f867d

chore: Add the Label kind to the flatbuffer representation

b11ac06

test: Make the columns test more accurate

5f8fbb1

chore: Update docs for experimental/universe

cabb49c

Marwes force-pushed the labels branch from b4375c9 to 742e446 Compare April 28, 2022 10:34

Markus Westerlind and others added 17 commits April 28, 2022 13:10

feat: Represent label variables in the flatbuffer

17ac065

chore: make staticcheck

f3bb91e

test: Verify that label types work through variable bindings

abb5e6d

chore: cargo clippy

61ea74a

chore: make generate

59fbbb1

Apply suggestions from code review

cf36289

Co-authored-by: Scott Anderson <[email protected]>

chore: Update SPEC.md with label polymorphism

344773f

chore: Ensure we error on unbound label variables

c74a199

<error> isn't the best in this error message but it isn't worse than the previous error

fix: Make copyMonoType work with record variables

0331566

chore: make generate

19baaa5

chore: Remove an unfinished sentence in docs/SPEC.md

2b377e1

feat: Use JSON to allow all kinds of features to be specified from th…

4e8f6e4

…e command line

docs: Remove unneccessary todo

fa0b4ff

chore: Address review comments

7a22672

chore: Update the feature flag passing for make checkdocs

3b0e2d8

chore: cargo clippy

80453ed

Marwes force-pushed the labels branch from 742e446 to cb6b113 Compare April 28, 2022 11:10

chore: make generate

1720617

Marwes force-pushed the labels branch from cb6b113 to 1720617 Compare April 28, 2022 11:20

Marwes merged commit 44445f7 into master Apr 28, 2022

Marwes deleted the labels branch April 28, 2022 11:40

Marwes mentioned this pull request May 11, 2022

Analyze the impact of of using polymorphic labels in the standard library #4740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add label polymorphism #4388

feat: Add label polymorphism #4388

Marwes commented Jan 5, 2022 •

edited

Loading

nathanielc left a comment

nathanielc Feb 23, 2022

Marwes Feb 23, 2022

nathanielc Feb 23, 2022

Marwes Feb 23, 2022

nathanielc Feb 23, 2022

Marwes Feb 23, 2022

Marwes commented Mar 17, 2022 •

edited

Loading

wolffcm commented Mar 17, 2022

Marwes commented Mar 18, 2022

feat: Add label polymorphism #4388

feat: Add label polymorphism #4388

Conversation

Marwes commented Jan 5, 2022 • edited Loading

nathanielc left a comment

Choose a reason for hiding this comment

nathanielc Feb 23, 2022

Choose a reason for hiding this comment

Marwes Feb 23, 2022

Choose a reason for hiding this comment

nathanielc Feb 23, 2022

Choose a reason for hiding this comment

Marwes Feb 23, 2022

Choose a reason for hiding this comment

nathanielc Feb 23, 2022

Choose a reason for hiding this comment

Marwes Feb 23, 2022

Choose a reason for hiding this comment

Marwes commented Mar 17, 2022 • edited Loading

wolffcm commented Mar 17, 2022

Marwes commented Mar 18, 2022

Marwes commented Jan 5, 2022 •

edited

Loading

Marwes commented Mar 17, 2022 •

edited

Loading